Download PVSOLA: A Phase Vocoder with Synchronized OverLap-Add
In this paper we present an original method mixing temporal and spectral processing to reduce the phasiness in the phase vocoder. Phasiness is an inherent artifact of the phase vocoder that appears when a sound is slowed down. The audio is perceived as muffled, reverberant and/or moving away from the microphone. This is due to the loss of coherence between the phases across the bins of the Short-Term Fourier Transform over time. Here the phase vocoder is used almost as usual, except that its phases are regularly reset in order to keep them coherent. Phase reset consists in using a frame from the input signal for synthesis without modifying it. The position of that frame in the output audio is adjusted using cross-correlation, as is done in many temporal time-stretching methods. The method is compared with three state-of-the-art algorithms. The results show a significant improvement over existing processes although some test samples present artifacts.
Download Pure Data External for Reactive HMM-Based Speech and Singing Synthesis
In this paper, we present the recent progress in the M AGE project. M AGE is a library for reactive HMM-based speech and singing synthesis. Here, it is integrated as a Pure Data external, called mage~, which provides reactive voice quality, prosody and identity manipulation combined with contextual control. mage~ brings together the high-quality, natural and expressive speech of HMMbased speech synthesis with high flexibility and reactive control over the speech production level. Such an object provides a basis for further research in gesturally-controlled speech synthesis. It is an object that can “listen” and reactively adjust itself to its environment. Further in this work, based on mage~ we create different interfaces and controllers in order to explore the realtime, expressive and interactive nature of speech.
Download Audio Time-Scaling for Slow Motion Sports Videos
Slow motion videos are frequently featured during broadcast of sports events. However, these videos do not feature any audio channel, apart from the live ambiance and comments from sports presenters. Standard audio time-scaling methods were not developed with such noisy signal in mind and they do not always permit to obtain an acceptable acoustic quality. In this work, we present a new approach that creates high-quality time-stretched version of sport audio recordings while preserving all their transient events.
Download Improving Synthesizer Programming From Variational Autoencoders Latent Space
Deep neural networks have been recently applied to the task of automatic synthesizer programming, i.e., finding optimal values of sound synthesis parameters in order to reproduce a given input sound. This paper focuses on generative models, which can infer parameters as well as generate new sets of parameters or perform smooth morphing effects between sounds. We introduce new models to ensure scalability and to increase performance by using heterogeneous representations of parameters as numerical and categorical random variables. Moreover, a spectral variational autoencoder architecture with multi-channel input is proposed in order to improve inference of parameters related to the pitch and intensity of input sounds. Model performance was evaluated according to several criteria such as parameters estimation error and audio reconstruction accuracy. Training and evaluation were performed using a 30k presets dataset which is published with this paper. They demonstrate significant improvements in terms of parameter inference and audio accuracy and show that presented models can be used with subsets or full sets of synthesizer parameters.